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_ The Center for the Study of Evaluation was founded 
in June, 1966. It is an educational xesearch and 
development center sponsored by the'U.S. Office of 


Education under the Cooperative. Research Act and is 
the only federally funded center working exclusively, | 
' on problems in educationa! evaluation. . 


The mission of the Center is to produce new? 


evaluation materials, practices, and knowledge which 


can be adopted and implemented by educational — 


agencies. Emphasis is placed on developing ‘ proce- 
dures and methodologies needed in the practical 
conduct of evaluation studies and on developing 


‘generalizable‘concepts and approaches to evaluation 
problems thrit are relevant to different levels of edu- © 


cation. The Center is directed by Marvin C. Alkin and 
is staffed by an interdisciplinary team which includes 
specialists in education, measurement, sociology, 


- economics, and administration. .. 


Evaluation Comment. provides discussion of sig- 
nificant ideas and controversial. issues in the study 
of evaluation of educational. systems and programs. 


A copy of Evaluation Comment is: distributed free of — 


charge to each scholar, researcher, or practitioner 
on our mailing list. One to five copies may be: ob- 
tained free of charge; however, where greater 
quantities are needed readers are encouraged to 


reproduce the Comment themselves. To be placed . 
on our mailing list or to order, . subject to avail- ~ 


ability, additional copies of Evaluation Comment,, 
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James Burry, Editor 
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For years various professional organizations in edu- 
cation and psychology have recognized the need to set 
specific criteria for assessment devices. However, 
attempts to develop such criteria have been, at best, 
timid (viz. Technical Recommendations). This timidity 
where "angels dare not tread" may not be completely 
reprehensible; it is the result of several factors: 


(a) any set of criteria will not be equally appropriate 
for all types of measures, 


(b) the direct result of the development of such a set 
of criteria would be the ability to evaluate criti- 
cally all available assessment devices, 

\ 


(c) the producers of the instruments might not be too 
pleased and, worse, might take well-reasoned 
issue with the criteria and their developers, and 


(d) the authors, being motivated primarily by altru- 
ism and social justice, might have to take their 
own inadequate, but lucrative, products off the 
market. 


The Center for the Study of Evaluation, in order to 
provide an equable appraisal of the output measures 
published for use in evaluating elementary schools, 
programs, and students, developed (1) a comprehensive 
objectives-based classification of needs-assessment 
areas for elementary education, and (2) a critical test 
evaluation procedure to apply tv measurement devices 
in any of the need areas. Preparatory to the evaluations, 
all those measures presently available for elementary 
school evaluation at the first, third, fifth, and sixth 
grades were located. Each test or sub-scale was. as- 
signed to the pre-established goal area into which 1: 
best fit. 1 
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Table 1 
OUTLINE OF 145 GOALS OF ELEMENTARY SCHOOL EDUCATION 
AFFECTIVE MUSIC 
1, TEMPERAMEN'T: PERSONAL 20. MUSIC APPRECIATION AND INTEREST 
A. Shyness:Boldness A. Music Appreeiation 
B. Neuroticisin-Adjustment B. Music Interest and Enjoyment 
C. General Activity-Lethargy 21. MUSIC PERFORMANCE 
2. TEMPERAMENT: SOCIAL A. Singing 
A. Dependence-Independence B. Musical Instrument Playing 
B. Hostility-Friendliness C. Dance (Rhythmic Response) 
C. Socialization-Rebelliousness 22. MUSIC UNDERSTANDING 


3% ATTITUDES 
A. School Orientation 
B. Self Esteem 
4. NEEDS AND INTERESTS 


A. Need Achievement 23. 


B. Interest Areas 


ARTS-CRAFTS 
§. VALUING ARTS 
A. Appreciation of Arts and Crafts 
B. Involvement in Arts and Crafts 


6. PRODUCING ARTS AND CRAFTS 25. 


A. Representational Skill in Arts and Cra fts 


AND CRAFTS 24. 


A. Aural Identification of Music 
B. Music Knowledge 


PHYSICAL EDUCATION — HEALTH — SAFETY 


HEALTH AND SAFETY 

A. Practicing Health and Safety Principles 

B. Understanding Health and Safety Principles 

C. Sex Education 

PHYSICAL, SKILLS 

A. Muscle Control (Physical Education) 

B. Physical Development and Well-Being: (Physical Education) 
SPORTSMANSHIP 

A. Group Activity — Sportsmanship 


B. Expressive Skill in Arts and Crafts ‘ ° B. Interest in and Independent Participation in Sports and Games 
7, UNDERSTANDING ARTS ANI) CRAFTS 26. PHYSICAL EDUCATION 
A. Arts and Crafts Comprehension ‘ A. Understanding of Rules and Strategies of Sports and Games 
B. Developmental Understanding of Arts and Crafts B. Knowledge of Physical Education Apparatus and Equipment 
COGNITIVE READING 
8. REASONING 27., ORAL-AURAL SKILLS 
A. Classificetury Reasoning A. Listening Reaction and Response 
B. Relational-Implicational Reasoning B. Speaking 
C. Systematic Reasoning 28. WORD RECOGNIT ION 
D. Spatial Reasoning A. Phonetic Recognition 
9 CREATIVITY B. Structural Recognition 
A. Creative Flexibility 29. READING MECHANICS 
B. Creative Mluency A. Oral Reading 
10 MEMORY B. Silent Reading Efficiency 
A. Span and Serial Memory 30. READING COMPREHENSION 
B. Meaningful Memory A. Recognition of Word Meanings 
C. Spatial Memory B. Understanding of Ideational Complexes 
C. Remembering Information Read 
FOREIGN LANGUAGE 31. READING INTERPRETATION 
ll. FOREIGN LANGUAGE SKILLS A. Inference Making from Reading Selections 
Av Reading es tr ala ofa Foreign Language a aK alata eee! Devices 
8B. Oral Comprehension ef a Foreizn Language ee CINK on ces : 
(. Speaking Fluency ina Foreign Language 32. READING APPRECIATION ANI) RESPONSE 
1), Writing Mluency ina Foreign Language A. Attitude toward Reading . : : 
12. FOREIGN LANGUAGE ASSIMILATION B. Attitude and Behavior Modification from Reading 
A. Cultural Insight through a Foreign Language C. Familiarity withStandard Children’s Literature 
B. Interestin and Application ofa Foreign Language 
RELIGION REE 
LANGUAGE ARTS 33. RELIGIOUS KNOWLEDG Dy 
13, LANGUAGE CONSTRUCTION 34. RELIGIOUS BELIEF 
A. Spelling SCIENCE 
B. Punctuation 35. SCIENTIFIC PROCESSES 
C. Capitalization A. Observation and Description in Science 
D. Grammar and Usage B. Use of Numbers and Measures in Science 
KE. Penmanship C. Classification and Generalization in Science 
F. Written Expression D. Hypothesis Formation in Science 
G. Independent Application of Writing Skills E. Operational Definitions in Science 
MW. REFERENCE SKILLS F. Experimentation in Science 
A. Use of Data Sources as Reference Skills G. Formulation of Generalized Conclusions in Science 
B. Summarizing Information for Reference 36. SCIENTIFIC KNOWLEDGE 
A. Knowledge of Scientific Facts and Terminology 
MATNiMATICS B. The Nature and Purpose of Science 
15. ARITHMETIC CONCEPTS 37. SCIENTIFIC APPROACH 


A. Comprehension of Numbers and Sets in Mathematics 
B. Comprehension of Positional Notation in Mathematics 
C. Comprehension of Equations and Inequalities 

1D. Comprehension of Number Principles 


16. ARITHMETIC OPERATIONS 38. 


A. Operations with Integers 
B. Operations with Fractions 


C. Operations with Decimals and Percents 39. 


17, MATHEMATICAL APPLICATIONS 
A. Mathematic Problem Solving 


B. Independent Application of Mathematical Skills 40). 


18. GEOMETRY 
A. Geometric Facility 


B. Geometric Vocabulary 41. 


19. MEASUREMENT 
A. Measurement Reading and Making 
BR, Statistics 
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A. Science Interest and Appreciation 
B. Application of Scientific Methods to Everyday Life 


SOCIAL STUDIES 


HISTORY AND CIVICS 

A. Knowledge of History 

B. Knowledge of Govenments 
GEOGRAPHY 

A. Knowledge of Physical Geography 
B. Knowledge of Socio-Economic Geography 
SOCIOLOGY 

A. Cultural Knowledge 

B. Social Organization Knowledjze 
APPLICATION OF SOCIAL STUDIES 
A. Research Skills in Social Studies 

B. Citizenship 

Interest in Social Studies 
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An outline of the goals is provided in Table 1 above. 
The tests and subtests were then evaluated in order to 
identify and endorse those output measures most ap- 
propriate, effective, and useful in assessing schools or 
students. The evaluation form used throughout the test 
evaluations is shownin Figure 1. 


The MEAN (an acronym for the four criterion areas 
to follow) evaluation procedure critically reflects four 
vital areas of concern to test users: Measurement 
Validity, Examinee Appropriateness, Administrative 
Usability, and Normed Technical Excellence. ‘Twenty- 
four separate evaluations, comprising the four major 
criterion areas, were performed on 1,649 scales in- 
dependently by at least two evaluators. These scales 
comprise all the output: measures that are prepared for 
or are potentially useful for evaluations within the ele- 
mentary school and that are generally available to 
educators and researchers, 


The four criteria comprising the MEAN system are 
explained below. They were meant to address'the interest 
areas of educators and also of educational researchers. 
However, the final ratings obtained for each test indicater 
its appropriateness for school evaluation settings rather 
than for clinical or research problems. 


Measurement Validity. Evaluations on the criterion 
of measurement validity were made in answer to the 
question: “Does the test appear to measure the specific 
educational objective?” (entry 71 of ‘Table 2). This is 
essentially a question of content and face validity, the 
validities being keyed to the pre-established goal areas 


for elementary education. Trained evaluators were 
instructed to judge each test according to its) capacity 
to assess the particular goal which it) purported to 
measure or which a plurality of its items appeared to 
reflect, The judgments were made on the basis of care- 
ful reading of the items to determine whether they ap- 
peared to assess the goal and whether they proportion- 
ately assessed the whole range of content within the 
goal. Such judgments were fairly well structured and 
reliable in the content achievement areas, but were 
more difficult to make in the non-content areas of 
affective and cognitive behaviours. A second aspect of 
measurement validity concerned the extent of reported 
empirical validation, either predictive or concurrent 
{entry 2, Table 2). 


Examinee Appropriateness, The second criterion of 
the MEAN evaluations was designed to assess how 
appropriate the test is for the students who will be 
assessed by it. Concern was directed toward the appro- 
priateness of the test's level of comprehension, its 
physical format, and its required response mode. 

F-uluation of the appropriateness of test content 
centered upon the difficulty of the semantic or numerical 
items and also upon the relevance or interest-arousing 
aspects of the items fentry 3, Table 2). Similar criteria 
were applied to the test instructions since they deter- 
mine whether or not the examinee will be able to mani- 
fest his mastery of the item content (entry 4. Table 2). 
Instructions which appear simple to adults were often 
found to be confusing to young children. The second 
major area where appropriateness is felt to be impor- 


MEAN TEST EVALUATION FORM 


Test Name Form 


Evaluation Criteria 


1, Measurement Vatidities 
a. Content and Construct 


b Concurrent and Predictive 


1 (very little) 


0 (complicated) 


2. Fxaminee Appropriateness 
a, Coniprebension: content 


instructions 


1, Visual principles 


2. Quality of illustrations (print) 0 (not good) 


3. Tine and pacing 0 (had) 


e, Recording anawera 0 (complicated) 


3, Administrative Usahility 
a. Administration 
1. Test administration 


0 (individual) 


0 (psychometrist) 
0 (434 minutes) 
0 (subjective) 


2. Training of udministrators 
3, Administration 


b. Scoring 


Entei pretation 
1. Norms 
a. Norm range 


0 (restricted) 


b. Score interpretation 0 (uncommon, abst mise} 


c. Score conversion 


d. Norm groups 
d, Score Interpreter 


e. Can Decisions Be Made 


0 (paychometrist) 


. Normed Technical Excellence 
a. Stability 


h. Internal Consistency 


c. Alternate form 
d. Replicability 
e. Range of Coverage 


0 poorly graduated and uncommon 


poasthly appropriate 
2 


1 (prodahly good) 


Dl fap 2 (large groups) 


0 (local, outdated, or poorly sampled) 


not reported or less than .70 
0 1 


0 noinformation 1 floor or ceiling reached 2 adequate 


1 poorly graduated or uncommon 


Date 
Rating (circle one number in each row) 


Rater 


3 (not enough) } 4 (considerable) | 5 (exhaustive) 
probably appropriate exactl 
3 


y right 
4 


2 (outstanding aids) 


1 (helpful) 2 (excellent) 


I (appropriate for broad range) 
1 (standard) 2 (especially casy) 


1 (broad) 
1 (common, simple) 
2 (clear, tables) 
1 (national, well sampled) 
1 (school staff) 
3 yes~charts and graphs 


1 (difficult) 


2 probable 
.80 to .90 


o8 
+ 


3 more than adequate 


2 well graduated and standard 
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tant is that of test format. The visual or auditory prin- 
ciples employed in test presentation were evaluated in 
terms of effective usage of Gestalt’ principles (entry 5, 


Table 2). The evaluators looked for specific forniat 
features such as sufficiency of white space between 
items, visual or auditory coherence of item stems and 
allernatives, and effective use of colors as an aid in 
scpregating items. The general quality of illustrations 
and print was also considersd under physical format 
fentry 6. Table 2). 


For each scale, pacing or time limits were judged 
for their appropriateness for the subject matter and for 
the examinees fentry 7. ‘Table 2). Published statements 
regarding the speededness of tests were corroborated, 
when possible, by consulting item difficulty indexes and 
score distributions. In almost all) cases, power was 
preferred to speed as an attribute of tests of educational 
output. The last aspect of appropriateness considered 
was the mode of response recording (entry 8. Table 2). 
The more simple and direct connections between the 
item stem and the recording of a response were given 
more credit. All aspects of examinee appropriateness 
were rated relative to the specific grade level to which 
the testis directed. 


Administrative: Usability. After asking “What will it 
measure?” and “Is it designed for my students?", the 
next question was concerned with how usable the test is 
in terms of administration, scoring, interpretation, and 
decision making. These aspects of a test comprise the 
third criterion of the MEAN evaluations. 

It was assumed that for general assessment of edu- 
cational output, a test that can be administered to a 
large group is more desirable, Small group and individ- 
ually administered tests were judged to be Jess usable 
for evaluation of instructional programs (entry 9, ‘Table 
2): their usefulness for in-depth individual diagnosis 
was no‘ in question. A second variable strongly affecting 
a test's utility is the training necessary to administer 
the test appropriately fentry 10, Table 2). Since few 


district psychometrists focus their attentions on indivi- 
dual student problems, a test was deemed to have greater 
utility if it could be administered by the school staff, 
preferably by the students’ teacher. ‘Tests were also 
credited if they fit into a typical class period and did 
not necessitate special scheduling (entry 11, Table 2). 


The utility of a test is further affected by the scoring 
procedure it requires fentry 12, ‘Table 2). Simple and 
objective hand or machine scoring of tests was con- 
sidered optimal for utility; subjective scoring resulted 
in no credit, From a pragmatic viewpoint, while ease 
of administration and scoring are desirable, they are 
dwarfed by the importance of being able to interpret 
the scores and then of reaching some decision (entry 18, 
Table 2). Tests from which prescriptive decisions can 
be made were given greater credit. Common, simple 
scores for interpretation earned a test more credit. In 
addition, a broad normative sample fentry 13, Table 2) 
which allows for both high and low achievement) was 
vated superior to a restrictive sample; a current) and 
representative norming sample was also rated higher 


Jfentry 16, Table 2). 


The normative score conversions were evatuated 
according to three criteria. If the derived scale is 
common and generally understood, the test was given 
more credit: (entry 14, Table 2). If the conversion is 
clear and unambiguous, the test earned credit over those 
with complicated, multi-stage conversions (entry 15, 
Table 2). These two aspects of the derived scores de- 
termine in part who can interpret them. Tests yielding 
scores interpretable by school staff were preferred to 
those demanding the skills of a psychometrist (entry 17, 
Table 2). The final pragmatic consideration of a test's 
ulility rested on whether or not decisions, either in- 
dividual or group, can be made on the basis of informa- 
tion in the test manuals, 


Normed ‘Technical Excellence. The last major criter- 
ion of the MEAN evaluation procedure was concerned 


schools have resident psythometrists and since most with the reliability, replicability, and refinement of 
Table 2 
. 4 . . . 
Mean Ratings of Tests on 24 Evaluative Criteria 
Criteria Range Grade 1 Grades Grade fi Gradei 
Mensurement Validity 
1. Content and face validity 0-10 G.12 6.16 6.07 6.50 
2. Concurrent and predictive validity 05 1.00 0.96 Vi 1.26 
Examincee Appropriateness 
3. Content comprehension 0-4 oe ol $2 4,22 3G 
4. Instructions comprehension 0-4 3.21 3,22 4.26 420 
§. Visual principles of format 0-2 1.01 0.95 0.89 0.84 
6. Quality of illustrations 0-2 1,10 1.04 1,05 1.04 
7 Time and pacing 01 0.95 0.91 O.B4 0.86 
8. Response recording. ()-2 1.74 1.55 1.33 1,20 
Administrative Usability 
9. Test administration 0-2 Val LAT 1,65 1,80 
10. Training of udininistrators 0-4 0.75 0.81 0.87 O04 
11. Administration 01 0.88 O.R6 O82 ORY 
12. Scoring 0-2 1.56 164 LT 1.72 
13. Norin Range 0-1 0.69 0.74 0.82 0.76 
14. Score Interpretability 0-1 O.B4 0.81 0.85 0.85 
18. Score conversion 0-2 1.34 Vl Vd 1.36 
16. Norm representativeness 0-1 0,25 0,22 0.25 0,28 
17. Score interpreter 0-1 0.837 0.74 0,85 (1.88 
18 = Candecisiona be made 0-3 1,32 1.39 146 Lag 
Normed Technical Excellence 
19. Teat-reteat reliability 0-3 O15 0,23 0.25 0.24 
20, Internal-conaistency 3 1.00 0.88 1.21 1.16 
21. Alternative form reliability 03 0,23 0.35 042 0.40 
22, Replicability 01 0.90 0.90 0.93 0.04 
23, Range of coverage 0-3 1.43 1.56 1.76 1.80 
24, Gradation of acores 0-2 146 18 1.58 L57 
Number of Instruments 318 380 77 HOR 
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evaluated 
separately for published reports of test-retest (entry 19, 
Table 2), internal-consistency fentry 20, Table 2), and 


measurement of the tests. Reliability was 


alternate-form estimates (entry 21. ‘Table 2). Closely 
related to the concept of test reliability is that of repli- 
cability of procedures to obtain the scores (entry 22, 
Table 2). If procedures described in the test’) manual 
are complicated, subjective, and based upon abnormal 
samples, the test is clearly not replicable. Replicable 
procedures for obtaining scores were judged as more 
valuable. 


The range of coverage is also an important aspect of 
a test's technical excellence. A broad developmental 
range which is appropriate for one level of assessment 
but which can also be applied to students above and 
below that level was preferred to a restricted range (entry 
23, Table 2). Related .o the range problem is the re- 
finement or gradation of the inter-individual com- 
parison scores; the finer the gradation, the better the 


evaluation of the test (entry 24, Table 2). ° 
Each of the tests and scales, then, earned four 
scores; one for each of the MEAN criteria. These 


scores and their bases are published in CSE Elementary 
School Test) Evaluations. by Hoepfner, Strickland, 
Stangel, Jansen, and Patalino (1970) in greater detail.* 
The four MEAN scores were, however, based upon 
twenty-four individual judgments. These discrete judg- 
ments were factor analyzed in order to uncover the 
characteristics of tests which actually do cohere. ‘Table 
2 presents the twenty-four criteria, the range of points 
possible for cach of their evaluations, and the means of 
the consensual judgments for grades 1, 3, 5, and 6. 


The separate judgments for each of the scales within 
each of the four grade levels were submitted to a prin- 
cipal-axes factor analysis. Initial solutions showed that 
only four factors appeared with regularity in all four 
grade levels. Because a fifth factor only appeared in 
two of the solutions (not chronologically adjacent grade 
levels), communality iterations were based on four fac- 
tors. The matrices of intercorrelations among the 
rated characteristics are in A ‘Test of Tests. CSE 
Report No. 69, by R, Hoepfner. The varimax 
factor loadings for the four factors and for the four 
grade levels are presented in Table 3. 


Mean ratings of evaluative test qualities, as pre- 
sented in Table 2, indicated no significant trends of in- 
creased or decreased quality over the four grade levels, 
One of the most salient findings in Table 2 is the rela- 
tively higher reliability estimate obtained through 
internal-consistency techniques. Whether or not this is 
an artifact of the ease of its estimation or the vulner- 
ability of such estimates to extraneous inflationary 
factors cannot be determined. 


It can also be seen from Table 2 that publishers 
provide very little evidence for the concurrent and 
predictive validities of their tests in the manuals they 
provide. This reflects, of course, the great costs to the 
publisher of such studies and the necessary delay from 
the time the manual ig published to the time that various 
independent research findings can become incorporated 
into the publisher's documentation (if, indeed it ever is). 
Nonetheless, the typical rating on this criterion can be 
described as “very little evidence.” 


Tne comprehension levels of test items and instruc- 
tions appear rather satisfactory, all means falling above 
thea“probably appropriate” rating. This reflects the 
fact that most instruments at the elementary level are 
developed by curriculum experts at each grade level. 
Time and pacing and response recording procedures are 
also rated highly, probably for the same reason. 


The visual principles and quality of illustrations 
for tests are rated at only slightly abave average. Such 
mediocrity may he due to the expense of good graphics 
and layout or may be the result of a deliberate attempt 
by some publishers to avoid producing too polished a 
product (that might appear inore commercial than 
educationa)). 


The tests’ major shortcomings in the area of Ad- 
ministrative Usability are the low quality of norm-group 
sampling and the failure to provide prescriptive decision 
rules on the basis of test results. Maintaining norm 
currency and obtaining national representativeness of 
the nurm groups is the most expensive aspect of test 
publishing, and so it is not surprising that norms lack 
these qualities. Definitive and prescriptive decision rules 


* A companion volume, CSE-ECRG Preschool / Kindergarten ‘Test 
Evaluations (1971), treats early childhood tests in a similar manner. 


Table 3 
Varimax Factor Loadings for 24 Criteria for Four Grade Levels 


Criteria Grade | Grade 3 

A B Cc n A B C n 
1 06 AS 50 Ot “2 “09 AG AT 
2 02 01 12 73 06 27 AW 63 
3 423 +08 AG 03 Be}] 16 50 18 
4 A 10 Ag 02 2 AG 47 wd 
5 Ol 05 WW 08 AT O04 1h 07 
6 2 +02 ah 02 ahd +06 AA ANN 
7. 20 “12 12 O7 19 +06 09 De) 
8 06 04 07 03 2X O06 0 dO 
, A} 10 12 00 90 07 00 03 
10. AD “01 Ris! +2 BR 03 07 Wd 
VM. Ai ald 01 sl2 02 04 09 «35 
12, V2 128 05 06 i) 21 Alt 08 
13. 0 AG 07 26 02 72 17 O08 
V4. ld wo 2d 220 12 AB 08 18 
15. Ad 25 122 +02 05 ah 2 05 
16. Ol 18 21 As AG Pit) 6 oh} 
17. Bd AG 02 02 Pais) 08 us) AS 
18, 12 20) 7 20 At 17 A 20 
19. +28 09 01 37 02 09 AbD NS 
20, 220 8 02 “50 226 AM 16 2 
21. Ad 08 02 wh AS wd 08 Psu) 
22, AS 01 03 2h) a 8 ao 08 
23, 9 53 +05 Lf nts 47 +06 122 
24. OG 7 04 ROS) 08 B2 OS ‘ 


Grade h Grade 6 


A B Cc 0 A B c ) 
02 12 G9 M8 A 22000 (OL 02 
OB AT DOH os | ee 
wl 0300 4 504 05 04 Bt 408 
wld 22 A082 AOS 12 BR 
3 08 1G 89 000 82 
3000 039 OK 2900 Od 260 «02 
w2000 NP 0203 WD 42300 060M 
AB OL 22 20000 «038 NG 
900 IR Ol BA aa 000 17 
BIB 06 1G 7 OR OM TD 
Ol 00 B87 AO OP 0300 Ah 
$9 IR 2006 1 od AO 0M 
06 BR 35 be 21 65 2708 
om 60 99 It OR TI 02.06 
AW BG 08 er nr be 
A ee ks A? 2B GO 
O11 ORS Bh 0808 OT 
03 AB 74D 05 27 BRAD 
O88 OS 01 BT 00 03 03 aD 
sr) ns CD) aE BRAS 02 
08 18 O07 03.23 OR 20 
BOR 40 2900 03 22 
2 6h 07 07 72 OBO 
o 90 6052 OO RB 04 TO 
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violate the often repeated (and frequently justified) 
warnings against too literal and decisive interpretations 
from faulty test scores. It seems that in foilowing these 
well-intentioned warnings, the publishers make their 
instruments less useful for most educators who cannot 
operate with the ambiguous decision-making data pro- 
vided for them. 


While it is difficult to draw conclusions from the 
massive amounts of data provided in the correlation 
matrices, the outstanding finding is the relative lack of 
correlation between the ratings on the two kinds of test 
validity. The correlations between the ratings of face- 
coutent and concurrent-predictive validities range from 
-13 to +12, clearly demonstrating their independence, 
not only as constructs, but as results of actual practice 
in test construction and development. 


The varimax solutions in Table 3 evidence consider- 
able factorial constancy over the four grade levels. The 
fact that some instruments were common to more ‘han 
one solution, being appropriate for a large grade span, 
cannot be hypothesized as accounting for this invariance, 
as there were few such overlapping instruments and 
the test evaluations were made separately at each 
grade level. 


Factor A, consistently led by the variables of ‘Test 
Administration, Training of Administrators, Score 
Interpreter, Scoring, and Replicability, clearly reflects 
a “Usability” dimension upon which tests can be placed. 
While not the same as the MEAN criterion of adminis- 
trative usability, it is related as four of the eight vari- 
ables having significant loadings are components of that 
criterion, It is interesting to note the consistent nega- 
live loadings for the Examinee Appropriateness ratings, 
especially for Visual Principles and Quality of Ilustra- 
tions; perhaps this indicates that increased efforts to 
make tests usable have resulted in decreased attempts 


Es 


al making tests appropriate for the examinees. 


Factor B is consistently led by the variables of Range 
of Coverage, Gradation of Scores, Norm Range, Score 
Interpretation, Score Conversion, and_ Internal-Consis- 
tency Reliability. This constellation of test attributes is 
nained the “Norm Quality" factor, implying that normed 
tests tend to be good or bad in most of the norming 
altributes. 


Factor C is led in all four grade levels by the vari- 
ables of Ability to Make Decisions, Content and Construct 
Validity, and Content Comprehension. The factor prob- 
ably reflects the amount of specificity of coverage of a 
test; tests being directed specifically to some focal 
goal area scored higher on these criteria. For this 
reason, Factor Cis called the “Focus” factor, 


Factor D is led by the variables of Concurrent and 
Predictive Validity, Norm Representatives, and ‘Test- 
Retest Reliability. In several of the grade levels, the 
Tactor is further supported by the variables of Internal- 
Consistency and Alternate-Form reliabilities. This 


sfactor is parallel to Factor B and is called the “Psy- 


chometric Quality" factor. Apparently, publishers either 
exhaustively analyze their tests on most psychometric 
criteria, tend not to analyze on any of the criteria, or 
seck some consistent level of psychometric analysis. 


Mean ratings of evaluations of tests, as presented 
in Table 2, indicate major shortcomings that’ charac- 
terize today’s published instruments for elementary 
education. A factor analysis of these ratings revealed 
four consistent dimensions upon which tests actually 
vary: Usability, Norm Quality, Focus, and Psycho- 
metric Quality. The results of this analysis of tests 
should have many immediate and long-term implications 
for the improvement of assessment instrumentation by 
pointing out rather clearly some of the shortcomings 
that characterize today’s published tests. 


